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Abstract 

As the volume of data grows, astronomers are increasingly faced with choices on what data to keep — and what to throw away. Recent work 
evaluating the JPEG2000 (ISO/IEC 15444) standards as a future data format standard in astronomy has shown promising results on observational 
data. However, there is still a need to evaluate its potential on other type of astronomical data, such as from numerical simulations. GERLUMPH 
(the GPU-Enabled High Resolution cosmological MicroLensing parameter survey) represents an example of a data intensive project in theoretical 
astrophysics. In the next phase of processing, the « 27 terabyte GERLUMPH dataset is set to grow by a factor of 100 — well beyond the 
current storage capabilities of the supercomputing facility on which it resides. In order to minimise bandwidth usage, file transfer time, and 
storage space, this work evaluates several data compression techniques. Specifically, we investigate off-the-shelf and custom lossless compression 
algorithms as well as the lossy JPEG2000 compression format. Results of lossless compression algorithms on GERLUMPH data products show 
small compression ratios (1.35:1 to 4.69:1 of input file size) varying with the nature of the input data. Our results suggest that JPEG2000 could 
be suitable for other numerical datasets stored as gridded data or volumetric data. When approaching lossy data compression, one should keep 
in mind the intended purposes of the data to be compressed, and evaluate the effect of the loss on future analysis. In our case study, lossy 
compression and a high compression ratio do not significantly compromise the intended use of the data for constraining quasar source profiles 
from cosmological microlensing. 
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1. Introduction 

1.1. The Petascale Astronomy Era 

The expression Big Data is currently a popular one in as¬ 
tronomy. It is also broadly and liberally used. Recognising that 
“bigness” is an arbitrary concept related to the usual data sizes 
and volumes within a sub-field of astronomy, we can consider 
three types of Big Data. The first case happens for projects deal¬ 
ing with lots of small to medium sized files. This is typical of 
many modern observational survey programs (e.g. |Ahn et alT] 
2012). Each image is only a few hundred megabytes (MB) 
in size, but many thousands of individual images are recorded 
and archived. The second case is when there are a few very 
hig files. Here belong cases such as the Millennium simulation 
( [Springel et al.||2005| . The full particle data was stored at 64 
time steps, each of size 300 gigabytes (GB), giving a raw data 
volume of nearly 20 terabytes (TB). Finally, the third case is 
when there are lots of very big files. For example, this is the 
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situation that will occur with the large spectral cubes from in¬ 
struments such as MeerKAT ( [Booth et al.| [2009| and ASKAP 
(Guzman & Humph reys! |2010| ). Similarly, the Square Kilome¬ 
ter Array is expected to gather 14 e xabytes of data and store 


IBM 


2013 


about one petabyte (PB0) every day (Lazio 2013 
IQuinn et aL||2015| ). 

With this growth of data volume in science come new chal¬ 
lenges. How to efficiently store such data in data storage facili¬ 
ties? How to transmit such data over a network in an acceptable 
time interval through limited bandwidth? How to perform visu¬ 
alisation and analysis on large samples of files? 

Visualisation and analysis of terabyte-scale data is already 
a challenge for existing astronomical softwares and the current 
work practices of astronomers (Hassan & Fluke} 2011 1 ) . Fur¬ 
thermore, such large data volumes cannot be processed, stored 
or viewed as a whole on desktop computers, even taking into 
account projected advances for hard drives and network tech¬ 
nologies ( [Kitaeff et al.[ |2012| . Research in a variety of sub¬ 
disciplines of astronomy is underway to solve such issues (e.g. 
|Anderson et al.[ |2Q11| |Broekema et al.[ |2012[ |Hassan et al.[ 
[2Ql2l[2Ql3] ). 


1 1 petabyte = 10 15 bytes. 
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As the volume of data grows, astronomers are increasingly 
faced with choices on what data to keep — and what to throw 
away. In this work, we concentrate on the questions regarding 
storage and transmission by investigating data compression. 

1.2. Data Compression 


compression affects analysis (e.g. White & Percival 1994||Shan ir 


Data Compression emerged from the work of Shannon (1948) 
on Information Theory. One of the critical ideas of Shannon’s 
theory was the relation between the probability of occurrence 
of a particular value v in a data source S of length n and the 
amount of information i that this value carried. If v occurs f{v) 
times, the probability p(y) of observing v in S is given by 

t \ 70) m 

p(v) = -. (1) 

n 

The amount of information associated with v is given by 
1 


i(v) = log = - log p(y). 
P(v) 


( 2 ) 


Averaging over all the information included in S such that 


H(S) = -n ^ p(y) log p(v) 


(3) 


V—1 


is the entropy of the source (often called Shannon entropy), 
a concept closely linked to Ludwig Boltzmann’s second law of 
thermodynamics which stipulates that the entropy of a system 
can only increase ( jFrank| |[2002| ). It is a lower bound on the 
space required to represent a source of information perfectly 
( [Shannon! 11948f ). 

Compression is achieved by using an encoder to remodel 
the original data in a compressed manner, bringing it close to or 
even beyond its entropy. Later, a decoder is used to uncompress 
the compressed data back to its original form. We can note that 
the concept of entropy naturally divides data compression into 
two main paradigms: lossless compression , where the output 
data of the decoder is identical to the encoder’s input data, and 
lossy compression , where the output of the decoder is different 


to the encoder’s input (Salomon} 2007) — it is an approxima¬ 
tion of the input. 

1.3. Big Data, data format, and data compression 

The idea of minimizing the volume of astronomical data 
using data compression can be traced back to the 1970s (e.g. 


niques have been used and developed ever 

since (e.g. Landau 

& Ghigo 1984; Andrianov 1984, Cafforio et al., 1985, Press 

1992 

White & Percival, 1994; Veran & Wright 1994, Starck 

et al. 

1995; Vasilyev, 1998; Gaudet et al. 

2000; Pence et al. 

2000 

Fixsen et al., 2000 [ Pence et al. ,2011 

Seaman! 2011). 


In the context of the petascale astronomy era, prior work 
considered questions such as: which data format/model should 


be used to enable work with Big Data (e.g. Kitaeff et al. 
Kitaeff etakl|2014| this issue; |Natuschl|2014!|Mink et~ 


2012 


2014 


Price et al.| 2014|); what bandwidth to keep in case of lossy 
compression (Price-Whelan & Hogg 2010); and whether lossy 


|& Nemiroff] [2005] IPence et~aH |20 1 Qj | Vohl| |2Q 13 1 IPeters & Ki 

|taeffl|20l4l ). 

Recent work by Kitaeffe^aL ( 2012|>, |Kitaeff et al.| ( |2014 


this issue), and jPeters & Kitaeff |20l4j investigates whether the 
JPEG2000 (ISO/IEC 15444) standards could be adopted more 
generally within astronomy. Within the JPEG2000 specifica¬ 
tion are features attractive to astrophysics such as: progressive 
transmission; the ability to decode only part of the data with¬ 
out having to load it all into memory (useful when dealing with 
larger-than-memory files); and the possibility to include cus¬ 
tomized metadata. 

In the context of radio astronomy, [Peters & Kitaeff] ( |2014[ ) 
and |Kitaeff et al.| ( |2014[ this issue) showed that the data format 
prescribed by the JPEG2000 standard can deliver high com- 
pressi on ratios with limited e rror on analysis of observational 
dat^] |Peters & Kitaeff ( 2014] ) evaluated the effect of the lossy 
compression on analysis. To do so, they ran a source finder on a 
spectral-imaging data-cube pre and post compression and com¬ 
pared the soundness of the outcome. They showed that using 
the JPEG2000 format, the strongest sources (> 2000 mJy km/s 
and higher) may still be retrievable at extremely high compres¬ 
sion ratio; in such cases, the compressed file would be more 
than 15,000 times smaller than the original file. In cases where 
they used a high quantization step during the compression (giv¬ 
ing them compression ratio of about < 90:1), JPEG2000 en¬ 
abled them to identify low integrated flux sources (less than 800 
mJy km/s). They concluded that the compression is denoising 
the cube, allowing sources previously obscured by noise to be 
identified. 

These results are promising for the observational commu¬ 
nity. However, to our knowledge, there is not yet any systematic 
work on other types of astrophysical data using the JPEG2000 
standard, such as data from numerical simulations. Observa¬ 
tional data is noisy by nature (e.g. the instrument capturing 
the data is imperfect). However, for numerical data, one would 
expect that lossy data compression will tend to introduce ad¬ 
ditional noise. It may therefore be considered an inherently 
negative aspect of such format for theorists who do not want 
to corrupt the data that has been carefully generated. But what 
actually is the effect of such compression on theoretical data? 
To investigate this question, we selected the Graphics Process¬ 
ing Unit-Enabled High Resolution cosmological MicroLensing 
parameter survey (GERLUMPH) theoretical dataset ( [Yemardos] 
|et al.|[2Ql4| ) as a case study. 

1.4. Case study: GERLUMPH 

GERLUMPH represents a recent example of a data inten¬ 
sive project in theoretical astrophysics. It has been designed to 
study quasar microlensing: the gravitational lensing effect of 
stellar mass objects within foreground galaxies that lie along 
the line of sight to multiply-imaged background quasars. There 


2 Peters & Kitaeff] {2014) actually used synthetic spectral-imaging data- 
cubes for their experiment allowing them to have a complete knowledge about 
the sources in the cube. 
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are currently « 90 known multiply imaged quasars, but an in¬ 
crease to a few thousands is anticipated to happen soon due to 
the commencement of synoptic all-sky surveys ([Oguri & Mar¬ 


shall 2010). GERLUMPH has been developed to explore and 


understand the quasar microlensing parameter space (defined 
in terms of the external shear, y, and the convergence, k) by 
providing a theoretical resource in preparation for these future 
discoveries. 

Quasar microlensing is usually studied numerically using 
magnification maps: pixelated versions of the caustic pattern 
in the background source plane created by the foreground mi¬ 
crolenses. 

GERLUMPH accelerated the process of generating cosmo¬ 
logical magnification maps by using graphic processing units 
(GPU; see Thomps on et al.||2010| ) resulting in the generation 
of - 70,000 maps in a short period of time (Vernardos et al.| 


2014). This work was completed on the GPU Supercomputer 


for Theoretical Astrophysics Research (gSTAR) national facil¬ 
ity hosted at Swinburne University of Technology. Each map is 
composed of 10,000 2 -pixel requiring « 400 MB per map. This 
corresponds to a manageable « 27 TB of data for the parameter 
survey. 

Observational properties of microlensed quasars can be ob¬ 
tained by performing a convolution between a quasar model 
and a magnification map. The next stage of GERLUMPH pro¬ 
cessing will involve convolution of 100 different physically- 
motivated quasar profiles with all of the GERLUMPH maps. 
It will produce a dataset two orders of magnitude larger, which 
would exceed the current available GERLUMPH storage space 
on gSTAR (currently « 3.4 PB for all projects). While it is de¬ 
sirable to save the convolved maps for future use — can they 
actually be stored? 


7.5. Magnification maps 

To understand the opportunities for compression, we need 
to understand the properties of the data. GERLUMPH gener¬ 
ates the magnification maps using a technique called Inverse 
Ray- Shooting ([Kayser et al.| |1986[ |Schneider & Weiss] |1986 


1987[ Thompso n et al. 2010| ). The technique proceeds by shoot¬ 
ing a large number of light rays (« 10 9 ) from the observer 
through the lens plane, where they are deflected by A* indi¬ 
vidual microlenses (compact, point-mass objects), defined as 


A* = 


k*A 

Kin)' 


(4) 


Here k * is the convergence caused by compact objects, (M) 
is the mean mass of the microlenses, and A is the area where 
they are distributed. The total convergence in cosmological 
microlensing has contributions from compact objects, k *, and 
smooth matter, k s , such that: 


K = K S + K *. (5) 

The gravitational lens equation, used in the context of cos¬ 
mological microlensing by A* microlenses of mass m in the 
presence of a smooth matter distribution and an external shear, 
y, is defined as: 


y = 


i -y 

o 


o 

i -y 


- K S X~ 


iv* 

£ 

i=1 




(X - Xj) 
\x - Xi \ 2 ' 


( 6 ) 


The quantities x and y are two-dimensional vectors in the 
lens and source plane respectively, jq are the location of the 
A* individual lenses with mass m E Each projected light ray 
from the lens plane to the source place is accumulated on a 
grid: the magnification map. The characteristic microlensing 
scale length in the source plane is the Einstein radius defined 
as: 


Re in - 


\D os D ls 4G{M) 


D n 


(7) 


where D oE D os , and Di s , are the angular diameter distances 
from observer to lens, observer to source, and lens to source re¬ 
spectively, (M) is the mean mass of pointmass microlenses, G 
is the gravitational constant, and c is the speed of light. A typi¬ 
cal range of values for R E i n can be obtained from the sample of 
87 lensed quasars compiled by |Mosquera & Kochanek|p011| : 
5.11 ± 1.88 x 10 16 cm. In this paper, we use the mean from 
the [Mosquera & Kochanek| ( |2011| ) sample as a typical value 
for R Ein . Therefore, the pixel size of the high-resolution GER¬ 
LUMPH maps corresponds to « 1.28 x 10 14 cm. 

The number of light rays that reach a pixel of the source 
plane Ay at coordinate i, /, compared to the number of rays 
that would reach each pixel if no microlensing was occurring, 
N avg , gives the local magnification value, //y: 


Nr 


R'iJ - 


IJ 


N 


avg 


( 8 ) 


In this work, we use two types of maps. A point-source 
magnification map (hereafter original map) is represented as an 
array of integers where each point is the ray count (Ay) at a 
given pixel in the map. A magnification map convolved with 
a quasar profile (hereafter convolved map) is a matrix of floats 
representing a magnification factor (pij) at a particular pixel. 
An example of such maps is shown in Figure [I] 

Convolved maps are slightly smaller than original maps. To 
avoid introducing distortion from the periodic nature of the con¬ 
volution when applied to our non-periodic maps, a frame is re¬ 
moved from the map after the convolution process. The width 
of each cropped section is half of the quasar profile width, i.e. a 
10,000 2 -pixel original map convolved with a 500 2 -pixel quasar 
profile will become a 9500 2 -pixel map. 


2. Lossless data compression experiment 

Starting with the idea that data corruption is an undesirable 
outcome, we first evaluate lossless compression. We compare 
results from popular lossless compression algorithms with lossy 
compression. We evaluated four lossless compression software 
commonly found on most linux distributions, as well as two 
other lossless compression solutions (LZ4 and bitpacking) in 
order to evaluate how well such techniques compress the GER¬ 
LUMPH data. 
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Figure 1: Examples of GERLUMPH magnification maps. Each image is a close-up portion of 1000 2 -pixel, equivalent to 2.5 Rpm- 
The upper left map is an original map. Upper right, lower left and lower right quadrant show the original map convolved with a 
10 2 -pixel, 100 2 -pixel, and 500 2 -pixel quasar profiles respectively. The quasar profile widths correspond to 0.1%, 1% and 5% of the 
map width. 


2.7. Evaluated softwares 

Gzip (version 1.3.12) is part of the family of compression 
techniques called “dictionary methods”, in which a dictionary 
of substrings is constructed to represent the data. Gzip im- 


plements the Deflate algorithn^ 

j a combination of the classic 

LZ77 algorithm (|Ziv & Lempel, 

1977) and static Huffman cod- 

ing (Huffman 

1952). The main idea behind LZ77 is to reduce 


the redundancy of the data by replacing values, or preferably 
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3 http ://tools .ietf. org/html/rfc 1951 
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a sequence of values, with metadata referencing to a dictionary 
of recurrent sequences of values seen previously by the encoder. 
Encoding sequences of values as blocks enables the encoder to 
take advantage of mutual information in neighbouring values. 
Huffman coding is an entropy coder based on uniquely decod- 
able, variable length, prefix codes that associates shorter codes 
for frequent values and longer codes for rare ones. 

bzip2 (version 1.0.5) is a mixture of three techniques ( |Sa-[ 
lomon||2007|). The first tech nique is the Burrows-Wheeler trans¬ 


form ( [Burrows & Wheeler [ |1994[ ), a reversible transform used 
as pre-processing for compression. It sorts blocks of data, sim¬ 
plifying the structure of input data so as to enable more effective 
compression. The second is the move-to-front (Bentley et al 


|1986| technique, a locally adaptative method, adapting to the 
frequencies of symbols in local areas of the data. Finally, the 
last technique is the Huffman coding described above. 

LZMA (version 4.999.9beta) stands for Lempel-Ziv-Markov 
chain-Algorithm. It is also a LZ77 variant based on a large 
search buffer, a hash function that generates indexes, and 2 
search methods: the fast method, which uses a hash-array of 
lists of indexes and the normal method, which uses a hash-array 
of decision trees ( [Salomon) |20071 ). xz (version 4.999.9beta) is 
a multi-functionality software that incorporates the LZMA algo¬ 
rithm. It not only enables compression of a single file, but also 
enables archiving functionality (many files into one compressed 
file). We evaluated this software to compare if the implementa¬ 
tion differs from the standalone LZMA. 

LZ4 (version rlOl) is a recent popular and really fast variant 
of LZ77, designed for parallel computing. The design principle 
behind this software is simplicity delivered through a simple 
code that delivers a fast execution ( |LZ4[|20lT] ). It generally de¬ 
livers lower compression ratios than the previously mentioned 
techniques, but is rather meant to have a much smaller runtime, 
which can be of interest when one needs to compress a great 
number of files. 

The bitpacking implementat ion w as evaluated following 
a discussion in Vernardos & Fluke ( 2014) stating that disk space 
could be saved using less than 32 bits to encode integers, as not 
all maps require 32-bit to represent the necessary raw informa¬ 
tion. Instead of defining our own integer types and therefore 
dealing with the computation overhead, we investigated how 
much space can be saved by “bit-packing” into 32-bit buffers 
the magnification maps for which less than 32-bit is required, as 
described in Femire & Boytsov| ( |2014| . The software evaluates 
the number of bits required to represent all values in a magnifi¬ 
cation map as a function of b = riog 2 (max)"|, where max is the 
maximum value in the file. If b < 32, it packs nxb into 
32-bit buffers, where n is the number of values to encode. 


2.2. Methodology 

For each lossless technique mentioned in the previous sec¬ 
tion, we obtained the compression ratio (#:1) where 


4 We modified the fast C++ implementation of Daniel Lemire 
( http://pastebin.com/ugGnkOOp I to work with our data. More information can 
be found at http://lemire.me/blog/archives/2012/03/06/how-fast-is-bitpacking/ 


SlZe or iginal 

#=~ -, (9) 

Size compressed 

and the median time in seconds (s) of three rounds of compres¬ 
sion and decompression. We selected 306 maps uniformly dis¬ 
tributed over the k , y parameter space covered by GERFUMPH- 
GD1. We used two different types of maps: original maps 
and maps convolved with 3 different quasar profile widths: 10- 
pixel, 100-pixel, and 500-pixel corresponding to 0.1%, 1% and 
5% of the map width respectively (Figure [T]). This resulted in 
306 x 4 = 1224 maps per compression technique. 

The different data types (integer, float) have an effect on 
general purpose lossless compression algorithms. By evaluat¬ 
ing compression over parameter space, it enables us to evalu¬ 
ate variations in compression ratio and runtime relative to map 
types. 

All techniques tested in this section, except bitpacking, 
includes flags which can be set to obtain different compression 
range. The flags let the user choose between lower to higher 
compression ratio. Fower compression ratio is faster at runtime 
while higher compression takes longer to run. We tested these 
extreme values for all techniques. All lossless experiments were 
conducted using Finux (CentOS release 6.6) on SGI C2110G- 
RP5 nodes containing 2 eight-core SandyBridge processors at 
2.2 GHz, where each processor is 64-bit 95W Intel Xeon E5- 
2660, with 10 GB of RAM allocated. 

2.3. Results 

Table [T] shows the mean results for median compression 
time ( ctime ), decompression time (< dtime ), and compression ra¬ 
tio (ratio) for each map type (original and convolved maps) we 
used from GERFUMPH. The global compression ratio results 
are plotted in Figure [2] 

A first observation we can make is that all techniques per¬ 
form better on arrays of integers than on arrays of floats. We 
also see that as we convolve with the larger quasar profile, the 
compression ratio for the convolved maps (array of floats) in¬ 
creases but is generally much lower than for the original maps 
(array of integers). Overall the compression and decompression 
times decrease as the size of the quasar profile is increased. 

From the table and figures, we can draw several conclu¬ 
sions about the lossless compression software we evaluated. 
First of all, LZMA and xz provide the best compression ratio on 
average for convolved maps (floats), and bzip2 delivered the 
best compression ratio on average for original maps (integers). 
These three techniques are slow relative to the other techniques 
tested, xz is slightly slower than LZMA. LZ4 and bitpacking 
deliver lower compression ratio than their evaluated rivals, but 
are much faster than their tested counterparts. There is a trade¬ 
off between speed and compression ratio; however, this gap is 
much smaller for bzip2. 

All of these compression techniques were faster at decom¬ 
pression than at compression, which is important for archiving 
purposes as one only needs to compress once, but may need to 
decompress many times. An important point to note is that LZ4 
only compressed the integers files (where the mean compres¬ 
sion ratio for convolved maps is FI). Therefore, even if it was 
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Table 1: List of GERLUMPH parameter space’s mean values: median compression time (ctime) in seconds, median decompression 
time (dtime) in seconds and compression ratio (ratio; #:1) for original and convolved maps. A plus (+) means that the technique 
was executed with the highest compression ratio flag (generally slower), while minus (-) indicates the lowest compression ratio flag 
(generally faster). Bit packing is a technique only applicable for original maps (integer values). Bold indicates best results for a 
given map type. 




bzip2 

Gzip 

LZ4 

LZMA 


xz 

bitpacking 



- 

+ 

- 

+ 

- 

+ 

- 

+ 

- 

+ 



ctime 

44.91 

44.27 

12.50 

507.44 

2.53 

33.16 

53.78 

245.78 

54.47 

247.50 

6.39 

Original map 

dtime 

15.05 

19.46 

6.34 

5.12 

1.67 

1.24 

16.06 

16.53 

16.91 

17.44 

6.12 


ratio 

3.94 

4.28 

2.63 

2.81 

1.39 

2.34 

3.71 

4.08 

3.71 

4.08 

2.00 


ctime 

85.59 

86.62 

25.86 

31.63 

1.06 

17.96 

86.05 

127.12 

86.84 

128.22 

— 

10 * 1 2 -pixel quasar profile 

dtime 

37.18 

38.79 

6.67 

6.27 

0.99 

0.99 

37.88 

38.26 

38.55 

39.04 

— 


ratio 

1.11 

1.13 

1.13 

1.14 

1.00 

1.00 

1.40 

1.47 

1.40 

1.47 

— 


ctime 

72.12 

74.34 

23.03 

27.34 

1.09 

16.98 

68.86 

106.22 

69.30 

107.03 

— 

100 2 -pixel quasar profile 

dtime 

33.35 

35.59 

5.94 

5.61 

1.05 

1.06 

26.79 

26.82 

27.35 

27.46 

— 


ratio 

1.11 

1.13 

1.14 

1.15 

1.00 

1.00 

1.60 

1.65 

1.60 

1.65 

— 


ctime 

65.55 

67.11 

20.43 

24.36 

0.94 

14.56 

60.44 

97.07 

60.88 

97.68 

— 

500 2 -pixel quasar profile 

dtime 

29.98 

32.09 

5.43 

5.15 

0.91 

0.91 

23.56 

23.69 

24.12 

24.28 

— 


ratio 

1.12 

1.14 

1.14 

1.15 

1.00 

1.00 

1.76 

1.78 

1.76 

1.78 

— 


technically the fastest technique, it was practically useless for 
convolved maps. 

Such lossless results would not enable the storage of all the 
expected data of the convolution phase of GERLUMPH. As dis¬ 
cussed in section [L4| the expected amount of generated data in 
the convolution phase is about 2.7 PB. Given the results for 
convolved maps, such lossless compression would only com¬ 
pression to the order of the petabyte, still exceeding the cur¬ 
rent available storage space for GERLUMPH of « 50 TB on 
gSTAR. 


3. Lossy data compression experiment 

The lossy JPEG2000 standard has been evaluated in con¬ 
sideration of its potential as a future astronomy standard. As 
an extension of the work of Kitaeff et al. ( 2012), Peters & Kita- 
|eff| ( |2014| ), and |Kitaeff et al.| ( |2014| this issue), we evaluate the 
lossy JPEG2000 format for original and convolved magnifica¬ 
tion maps. 


3.1. Evaluated software 

KERLUMP10 the GERLUMPH extension of the Kakadu Soft¬ 
ware Development Kit (KDU, version 7.4Q has been devel¬ 
oped to take advantage of the image-like maps of GERLUMPH. 


It converts a binary file into a compressed jp2 file (Adams 


2002). An interesting feature of the jp2 format is the availabil¬ 


ity of inner customizable metadata about the map. KERLUMPH 
enables lossy compression for both integer and real type of 
values, and to go back from .jp2 to .bin in order to enable 
work with raw-like data. A low level description of the stan¬ 
dards is beyond the scope of this paper, instead we refer the 


5 Available at http://supercomputing. swin.edu.au/projects/kerlumph/ 

6 http ://kakadusoftware .com 


reader to Taubman ( 2005]) and [Li] ( |2003| ) for explanations on 
the JPEG2000 standard and its related mathematics. 

A KERLUMPH user needs to supply the input file, the output 
file name (jp2 file), the dimensions of the input map, and the 
data type (32-bit integer or 32-bit float). It is also possible to 
supply other JPEG2000 parameters such as those tested in this 
paper (Qstep and Clevels) in order to overwrite the default pa¬ 
rameter value. The application then evaluates the range of the 
input data, converts the data to floats (required for integers val¬ 
ues), and scales it to the [-0.5, 0.5] range required for the lossy 
JPEG2000 compression. The original range and the data type 
are saved into the jp2 file header. Decompression inverts the 
compression process and returns a binary file with the original 
data type and range. 


3.2. Methodology 


The JPEG2000 parameter space can be overwhelming if ev¬ 
ery combination of compression parameters was to be tested. 
Table [2] shows the list of parameters taken from the core pro¬ 
cessing system, Part 1 of the JPEG2000 standard ([ISO/IEC 1544|4 
1 1:20001 [ 20001 ). 


Given the k, y parameter space of GERLUMPH that we are 
covering, testing all possible parameters and combinations of 
compression parameters would be extremely time consuming. 
Instead, as we are working with data that approximates grey 
scale still-image maps, we selected 2 parameters dictated by 
part 1 of the JPEG2000 standard (highlighted in Table[2]). These 
parameters are: 


1. The coefficient quantization step size (Qstep), which dis- 
cretises the wavelet coefficient values. It enables a trade¬ 
off between compressed image quality and encoding effi¬ 
ciency ( [Clark||2008] ). 


2. The number of levels (Clevels) in the Discrete Wavelet 
Transform tree, which influences the wavelet domain be¬ 
fore quantization and encoding. Increasing the number of 
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IQ 


Convolved maps (small) 



Convolved maps (large) 



Figure 2: Box and whiskers plots showing compression ratio statistics for 306 original maps (array of integers) and 306 maps 
convolved with 10 2 (small), 100 2 (medium) and 500 2 (large)-pixel quasar profiles (array of floats), for different lossless compression 
software. The median (red line) is within the box bounded by the first and third quartiles range (IQR = Q3-Q1). The whiskers are 
Ql-1.5 x IQR and Q3 + 1.5 x IQR. Beyond the whiskers, values are considered outliers and are plotted as crosses. Bitpacking 
is only applicable on integer data. Therefore, there is no result relative to this technique for the convolved maps. For all other 
techniques, a plus (+) means that the technique was executed with the highest compression ratio flag (generally slower), while 
minus (-) used the lowest compression ratio flag (generally faster). 


DWT levels let examine the lower frequencies at increas¬ 
ingly finer resolution, packing more energy into fewer 
wavelet coefficients and leading to the expectation that 
compression performance improves as the number of lev¬ 
els increases ( |Clark[|2008| ). 


[Peters & Kitaeff | ( [201 4| ) showed that the code block size and 
precincts size had no effect on both compression and soundness 
of the spectral cube data. Therefore, we have bypassed these 
parameters for this evaluation. Another difference in our ap¬ 
proach is that we have not used Clevels by itself with all other 
parameters set to default as in Peters & Kita eff ( 2014 ). Instead, 


we have combined Qstep with Clevels , with the hypothesis that 
compression performance would improve as the wavelet de¬ 
composition levels increase for a similar quantization step size. 
The explored parameter space is displayed in Table [3] 

One can observe that our choices for the Qstep parameter 
vary from those used in Peters & Kitaeff ( 2014 ), which started 
at 10 -6 and ended at 0.01. As our maps showed different order 
of details depending on the type (original or convolved), and 
depending on the quasar profile size used in the convolution 
phase, we used a broader range to let us find optimal values for 
our different map type. Our search for optimal compression is 
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Table 2: Parameters in Part 1 of the JPEG2000 Standard, or¬ 
dered as encountered in the encoder. The two parameters we 
investigated are highlighted. 



Parameter 

1 . 

Reconstructed image bit depth 

2. 

Tile size 

3. 

Color space 

4. 

Reversible or irreversible transform 

5. 

Number of wavelet transform levels 

6. 

Precinct size 

7. 

Code-block size 

8 . 

Coefficient quantization step size 

9. 

Perceptual weights 

10. 

Block coding parameters: 

(a) Magnitude refinement coding method 

(b) MQ code termination method 

11. 

Progression order 

12. 

Number of quality layers 

13. 

Region of interest coding method 


Table 3: JPEG2000 parameter space explored for each GER- 
LUMPH map. 


Parameter 

Default 

Start 

End 

Step 

Qstep 

1/256 

10- 7 

* 2 

xS/2 

Clevels 

5 

5 

32 

+27 


depicted in Algorithm [I] 

All lossy experiments were conducted using Linux (CentOS 
release 6.6) on SGI C2110G-RP5 nodes containing 2 eight-core 
SandyBridge processors at 2.2 GHz, where each processor is 
64-bit 95W Intel Xeon E5-2660, with 4 GB of RAM allocated. 


3.2.1. Criteria for evaluating lossy compression 

Since we used JPEG2000 as a lossy compression technique, 
we have to evaluate its effect on future analysis. The microlens- 
ing community uses two main analysis techniques: the single¬ 
epoch imaging technique ( [Bate et alT 2008[ Floyd et al.[|2009| 
and the light curve technique (Kochanek 2004). 

The single-epoch imaging technique uses the Magnifica¬ 
tion Probability Distribution (MPD) of the map. The MPD is a 
binned, normalized histogram of the count of each magnitude 
value, mag , in a map (Figure [3]): 


mag = log 


10 


/2-th 


( 10 ) 


with the local magnification value p t j (Equation^ for pixel 
i, j, and the theoretical magnification p l / 1 for the macromodel: 


1 


rth = t : -75-2- 01) 

(1 - k) l - y l 

The light curve technique consists of extracting many light 
curves from the magnification map. A light curve is defined as 


Algorithm 1: Find optima for a given map 
Data: Map , Qsteps (sorted array based on Table[3]) 
Result: Optimal results for Clevels e {5,32} 

begin 

foreach Clevels e {5,32} do 

Initialise ratio best , rmse best , q step best , previous m 
l <— 0, r <— \Qsteps\ - 1 

while / < r do 



ratio , rmse <— CompressAndCompare(mop, 
Qsteps[m ], Clevels) 
if previous m ^ m then 
if rmse <0.01 then 
ratiofast <— ratio 
rmsebest < — rmse 
qstepbest «— Qsteps[m\ 
if m > l then 
r <— m - 1 

else 

]_ l <— m + 1 
previous m <— m 

else 

\ break 

Save(ratiObest , rmseb es u q s tepbest> Clevels) 



log 

Figure 3: The single-epoch imaging technique uses the MPD 
(right), an histogram of the count of each magnitude value in 
a magnification map (left) normalised by the total number of 
pixels. The map in this Figure is taken at k = 0.52, y = 0.36. 
Source: gerlumph.swin.edu.au. 


a one dimensional array of neighbouring pixels. The value of 
each pixel of the light curve corresponds to the relative magni¬ 
tude A mag, where 


Amag = 2.5 log 10 — (12) 

/2-th 

for a given pixel at position i, j in the magnification map. 
For each light curve extracted, the minimum, maximum, mean, 
and standard deviation of relative magnitude are calculated. In 
this work, we extracted light curves of IREin in length and sam¬ 
pling length of 0.0025R^/„. Using the values for R Ein and ef- 
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fective source velocity from Kochanek & Schechter| ( [2004) for 
Q2237+0305, such light curve would be « 15 years long, sam¬ 
pled every ^13 days. 

To evaluate the effect of compression, we have evaluated 
the root-mean-square-error (RMSE) to quantify the difference 
between the map before and after compression. The RMSE is 
defined as : 


sion time (s) of three runs of compression are shown in Figure 
[9] All compression and decompression was executed in under 
10 seconds per map, while having median and mean time under 
3 seconds (Table [5]) on gSTAR. 


4. Discussion 


n m 

RMSE= A — V Y (Oij - Cij ) 2 , 
\ nm 

y 1=1 7=1 


(13) 


where n is the length of the map in pixel, m is the width of 
the map in pixel, Oij is the relative magnitude value (Equation 


12) of pixel at position i, j in the original map and C t j is the 


relative magnitude value at pixel i, j in the compressed distri¬ 
bution. 

To minimize the effect of compression on the analysis tech¬ 
niques described above, we have set a threshold on RMSE in 
order to obtain a near-lossless compression. As a guideline for 
RMSE, Poindexter et al. (2007) report a precision of 0.01, while 
Bate et al. ( 2008| ) report a precision in magnitude of 0.1. In this 


work, we consider an optimal near-lossless compression ratio 
such that RMS E <0.01. 


3.3. Results 

Table [4] compares the minimum, maximum, mean and stan¬ 
dard deviation in compression ratios obtained at RMS E < 0.01 
using Clevel e {5, 32} for the different map types (original and 
convolved). We also show the individual compression ratios 
against the GERLUMPH /c, y parameter space as obtained us¬ 
ing Clevel e {5, 32} in Figures [4] and [5] respectively. 

Only one of the original maps did not deliver a RMS E < 
0.01 in the range of Qstep evaluated (white square in the Fig¬ 
ures at k - 0.05, y = 1.0). We did not broaden our search 
range as the lowest compression obtained within the range was 
smaller (^ 3:1) than the best lossless compression presented in 
Section [2] 

In Figures [6] and [7j we show the results of extracting two 
distinct light curves from an original map and a map convolved 
with a 100 2 -pixel quasar profile. Each light curve is depicted 
using a different colour. The figures show a close up of 2000 2 - 
pixel region for both the map before (lower left panel) and after 
compression (lower right panel). We extracted light curves at 
the same coordinate in both maps and evaluated the residual 
pixel by pixel. The residual of each light curves is shown in the 
top panel of each figure. As expected from using the RMSE as 
a threshold for the whole map, most subsampled pixels in the 
residuals are < 0.01. 

Similarly, we plot the MPD for the same maps in Figure [8] 
The MPD of the original map stayed intact while showing lim¬ 
ited differences for the map convolved with a 100 2 -pixel quasar 
profile. Most of the differences are found in demagnification, 
which would not have much consequences on analysis as an 
image that is demagnified would not likely be noticed anyway. 

An appealing feature of building KERLUMPH from the KDU 
is the speed delivered through its multi-threaded processing abil¬ 
ity. The median compression time (s) and median decompres¬ 


As opposed to the lossless compression results which showed 
low compression ratios (<3:1 with a mean of 1.78:1 with LZMA 
for the maps convolved with the largest quasar profile), the 
near-lossless compression results suggest that a high level of 
compression (up to 325,812:1, with mean of 76,452:1) can be 
applied to the convolved magnification maps with minimal im¬ 
pact on their future scientific use. Convolving maps with a 
quasar profile eliminates lots of fine details in the map — the 
larger the quasar profile, the more fine details are eliminated 
(Figure |T]). As mentioned in Section |T2| increasing the Clevels 
parameter lets JPEG2000 decompose the lower frequencies in 
the map at increasingly finer resolution. Convolved maps showed 
to be good candidates for such a technique. 

On the other hand, even if the near-lossless compression 
offered higher compression ratios than the lossless compression 
softwares we investigated (4.28:1 on average with bzip2), the 
compression ratio is still low (at most 10:1, with mean ratio 
of 7.70:1). If compression ratio is the most important issue, 
one should note that these results are higher than their lossless 
counterparts with limited error introduction. 

As the original maps are the seeds for all the convolved 
maps, it is debatable that lossy compression should be applied 
to it, especially with the gain in compression ratio it provides. 
However, if it is indeed the path that is chosen, one can as¬ 
sume that minimal error will be introduced. Using the mean 
compression ratios, the lossless compression of original maps 
using bzip2 would shrink the storage requirement from « 27 
TB to « 6.3 TB, while using near-lossless compression using 
JPEG2000 would shrink it to « 3.5 TB. 

Given that we estimate the total amount of raw storage space 
required for the convolved maps to be two order of magni¬ 
tude larger than the one of the original maps (« 2.7 PB), the 
difference in compression ratio from lossless and near-lossless 
are very noticeable. If we do a rough estimate by using the 
mean compression ratio of all quasar profiles offered by LZMA 
(« 1.6 : 1), it would only shrink the storage requirement to 
« 1.7 PB. Such compression ratio would not enable us to store 
all of the expected data from the next phase of GERLUMPH. 
However, if we do the same evaluation for the near-lossless 
compression mean ratio (^29,993.4:1), we would use only « 90 
GB for all convolved maps. Such storage space is already avail¬ 
able within our resources. Also, as the compressed files are 
much smaller in size than the original, its transmission requires 
less time and therefore saves bandwidth usage. 

Our case study results suggest that JPEG2000 could be suit¬ 
able for other numerical datasets. In particular, well suited can¬ 
didate are likely to be data structure such as gridded data or 
volumetric data. When approaching lossy data compression, 
one should keep in mind the intended purposes of the data to be 
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Table 4: Comparison of compression ratio (#:1) obtained using Clevels e {5,32}, where RMSE < 0.01, for all different GER- 
LUMPH map types. Table shows the minimum, maximum, mean and standard deviation of all 306 maps per map type. 



Minimum 

Maximum 

Mean 

Standard deviation 

Clevels 

5 

32 

5 

32 

5 

32 

5 

32 

Original map 

5.14 

5.14 

10.00 

10.01 

7.70 

7.70 

0.72 

0.72 

10 2 -pixel quasar profile 

48.55 

48.88 

8,429.24 

30,121.51 

420.30 

732.24 

947.65 

3,017.25 

100 2 -pixel quasar profile 

1870.86 

2,529.88 

26,822.66 

167,252.56 

4,625.85 

12,795.24 

2,251.80 

20,281.67 

500 2 -pixel quasar profile 

5,185.66 

37,258.75 

29,931.18 

325,812.27 

7,165.02 

76,452.76 

2,544.40 

42,885.24 


Table 5: List of GERLUMPH parameter space’s mean and median time: median compression time (ctime) in seconds, median 
decompression time (dtime) in seconds and compression ratio (ratio; #: 1) for original maps and maps convolved with small, medium 
and large quasar profiles. 



Original 

Small 

Medium 

Large 


ctime 

dtime 

ctime 

dtime 

ctime 

dtime 

ctime 

dtime 

Distribution median 

2.58 

2.57 

1.94 

1.93 

1.78 

1.75 

1.56 

1.56 

Distribution mean 

2.88 

2.85 

2.21 

2.23 

2.14 

2.18 

1.85 

1.89 


compressed, and evaluate the effect of the loss on future anal¬ 
ysis. In our case study, results suggest that such compression 
will still enable us to do our science. 


ratios would enable us to comfortably compress 2.7 PB to less 
than a TB without corrupting the future science cases for which 
these data are meant for. 


5. Conclusion 


6. Aknowledgements 


We investigated a question relative to the petascale astron¬ 
omy era: can data compression allow us to store large numerical 
simulation datasets? We used the GERLUMPH dataset com¬ 
prising « 27 TB of microlensing magnification maps as a case 
study. In the next phase of processing, each of the 70,000 GER¬ 
LUMPH maps will be convolved with (9(100) different quasar 
profiles. The dataset is expected to grow to « 2.7 PB, way be¬ 
yond the current storage space capacity of the project. 

We presented a comparative data compression study be¬ 
tween several all-purpose (and one data-type specific) lossless 
compression software. We showed that such compression soft¬ 
ware delivered in best case a mean compression ratios of 4.28:1 
for the original maps, and 1.47:1, 1.65:1 and 1.78:1 for maps 
convolved with 10 2 -pixel, 100 2 -pixel and 500 2 -pixel quasar pro¬ 
files respectively. Such ratios do not provide a significant reduc¬ 
tion in the required storage size. 

We also presented a follow-up study to that of 
|taeff| ( |2Q14| ) and |Kitaeff et al.| ( |2014[ this issue) 
gated the JPEG2000 standard as a future astronomy data for¬ 
mat standard. To do so, we evaluated its suitability for data 
generated in numerical simulation using near-lossless compres¬ 
sion. We compared results of the two main techniques of quasar 
microlensing, the single-epoch imaging technique and the light 
curve technique, pre and post compression. We showed that 
JPEG2000 can deliver higher compression ratios (mean ratio 
of 7.7:1) for original maps than its lossless counterparts while 
keeping a low level of introduced errors. We also showed that 
for convolved maps, the compression ratios are much higher 
than in our lossless experiment (up to 325,812:1). 

This study shows that JPEG2000 can be suitable for our 
case of simulated astronomical data. Such high compression 
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thanks to Kakadu Software Ltd. and Dr. David Taubman for 
letting us use your KDU freely for this research. This work 
was funded by the Fonds de recherche du Quebec — Nature et 
technologies (FRQNT) and Swinburne Research. 
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Figure 4: Highest compression ratios obtained for RMS E < 0.01 using Qstep and Clevel = 5, plotted against the GERLUMPH 
parameter space (the shear y and the convergence k). Each square represents a single magnification map. Quadrants show the 
original maps (upper left), and the maps convolved with a 10 2 -pixel (upper right), 100 2 -pixel (lower left), and 500 2 -pixel (lower 
right) quasar profile. 
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Figure 5: Highest compression ratios obtained for RMS E < 0.01 using Qstep and Clevel = 32, plotted against the GERLUMPH 
parameter space (the shear y and the convergence k). Each square represents a single magnification map. Quadrants show the 
original maps (upper left), and the maps convolved with a 10 2 -pixel (upper right), 100 2 -pixel (lower left), and 500 2 -pixel (lower 
right) quasar profile. 
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Figure 6: Residual of two light curves of 1 R ein in length, extracted from an original map taken at k = 0.25, y = 0.5. Top panel 
shows the residual of two light curves (blue and red) extracted before and after compression. The same light curves are extracted 
from the map before (bottom left) and after (bottom right) compression (8:1) in order to evaluate the residual, at the locations 
depicted by the red and blue lines. The two maps in the bottom panel are 2000 2 -pixel close ups. 
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Figure 7: Residual of two light curves of 1 R ein in length, extracted from a map convolved with a 100 2 -pixel quasar profile (k = 
0.95,7 = 0.0). Top panel shows the two residual light curves (blue and red). The same light curves are extracted from the map 
before (bottom left) and after (bottom right) compression (17,271:1) in order to evaluate the residual, at the locations depicted by 
the red and blue lines. The two maps in the bottom panel are 2000 2 -pixel close ups. 
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Figure 8: Residual of two MPDs from an original map (left, k = 0.25, y = 0.5) and a map convolved with a 100 2 -pixel quasar profile 
(right, k = 0.95,7 = 0.0). These are the same maps used in Figures [6] and [7] respectively. Top panels show the two overlapping 
MPDs evaluated before and after compression (8:1 and 17,271:1 respectively) and bottom panels show the residual. 
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Figure 9: Median compression time (left) and median decompression time (right) of three runs of compression obtained for optimal 
results shown in figure [ 5 ] Maps convolved with 10 2 , 100 2 and 500 2 pixels quasar profile (array of floats) are marked with labels 
small , medium and large respectively, and original maps (array of integers) are labeled original. 
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